Data Science — Urban Forest Risk Assessment - Sprint 1 complete - Sprint 2 In Progress (20%)#1721
Data Science — Urban Forest Risk Assessment - Sprint 1 complete - Sprint 2 In Progress (20%)#1721
Conversation
Initial setup: - "playground" folder organisation - requirements file for virtual environment - notebook test (to check vscode config
EDA for all datasets being used. gitignore for data and venv files
All datasets cleaned and CRS aligned
Trees linked to nearest microclimate and soil sensors
- Engineered weather features (rolling averages, drought, heatwave) - Assembled everything into a feature table for ML model next
- Initial setup - Data preparation for ML risk scoring
manya0033
left a comment
There was a problem hiding this comment.
Hey @aidanuni thanks for walking me through Sprint 1 - the structure is clean and I can see a lot of thought went into the feature engineering (rolling temp averages, heatwave flags, consecutive hot days is a nice touch). Before I approve, a few things to address:
The CRS warning in 03_spatial_joins.ipynb is still showing because EPSG:7844 is actually a geographic CRS (GDA2020 lat/lon), not a projected one. You can see the effect in the distance stats, the column is labelled sensor_distance_m but the values range from 0.00002 to 0.05, which can't be metres for 82k trees. They're actually still in degrees. Could you switch to a projected CRS like EPSG:7855 (GDA2020 / MGA zone 55) before the sjoin_nearest calls? That'll give you genuine metre distances and the warning will go away.
The notebooks reference local paths like ../data/processed/feature_table.csv but I can't see where the raw data is actually coming from. If you're pulling the datasets from the Melbourne Open Data portal via API v2.1, could you include those API calls directly in the notebook so reviewers can reproduce the pipeline end-to-end? Otherwise, if any of the data is external, a CSV version should go in the DEPENDENCIES folder as per the checklist.
One suggestion, would it be possible to consolidate everything into a single notebook rather than five separate ones? Our use cases are meant to read as step-by-step tutorials, and having the whole pipeline (exploration -> cleaning -> spatial joins ->feature engineering -> ML model) in one notebook with clear markdown headers between sections would make it much easier for readers to follow along and reproduce. It also avoids the issue of needing intermediate files saved to disk between notebooks.
A short README in your project folder would also help- just a few lines on the pipeline order, where the raw data comes from, and any setup notes.
No worries that the ML section is mostly empty, the PR title is clear that Sprint 2 is 20% in progress. Happy to re-review once those are sorted!
Layer 1 — Data Pipeline & Feature Engineering (complete)
Layer 2 — ML Risk Scoring (in progress)